Contextual Representation using Recurrent Neural Network Hidden State for Statistical Parametric Speech Synthesis

نویسندگان

Sivanand Achanta

Rambabu Banoth

Ayushi Pandey

Anandaswarup Vadapalli

Suryakanth V. Gangashetty

چکیده

In this paper, we propose to use hidden state vector obtained from recurrent neural network (RNN) as a context vector representation for deep neural network (DNN) based statistical parametric speech synthesis. While in a typical DNN based system, there is a hierarchy of text features from phone level to utterance level, they are usually in 1-hot-k encoded representation. Our hypothesis is that, supplementing the conventional text features with a continuous frame-level acoustically guided representation would improve the acoustic modeling. The hidden state from an RNN trained to predict acoustic features is used as the additional contextual information. A dataset consisting of 2 Indian languages (Telugu and Hindi) from Blizzard challenge 2015 was used in our experiments. Both the subjective listening tests and the objective scores indicate that the proposed approach performs significantly better than the baseline DNN system.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Unit Selection with Hierarchical Cascaded Long Short Term Memory Bidirectional Recurrent Neural Nets

Bidirectional recurrent neural nets have demonstrated state-ofthe-art performance for parametric speech synthesis. In this paper, we introduce a top-down application of recurrent neural net models to unit-selection synthesis. A hierarchical cascaded network graph predicts context phone duration, speech unit encoding and frame-level logF0 information that serves as targets for the search of unit...

متن کامل

Statistical Parametric Speech Synthesis Using Bottleneck Representation From Sequence Auto-encoder

In this paper, we describe a statistical parametric speech synthesis approach with unit-level acoustic representation. In conventional deep neural network based speech synthesis, the input text features are repeated for the entire duration of phoneme for mapping text and speech parameters. This mapping is learnt at the frame-level which is the de-facto acoustic representation. However much of t...

متن کامل

Acoustic Modeling in Statistical Parametric Speech Synthesis – from Hmm to Lstm-rnn

Statistical parametric speech synthesis (SPSS) combines an acoustic model and a vocoder to render speech given a text. Typically decision tree-clustered context-dependent hidden Markov models (HMMs) are employed as the acoustic model, which represent a relationship between linguistic and acoustic features. Recently, artificial neural network-based acoustic models, such as deep neural networks, ...

متن کامل

Fundamental Frequency Modelling: An Articulatory Perspective with Target Approximation and Deep Learning

Current statistical parametric speech synthesis (SPSS) approaches typically aim at state/frame-level acoustic modelling, which leads to a problem of frame-by-frame independence. Besides that, whichever learning technique is used, hidden Markov model (HMM), deep neural network (DNN) or recurrent neural network (RNN), the fundamental idea is to set up a direct mapping from linguistic to acoustic ...

متن کامل

Model-Based Parametric Prosody Synthesis with Deep Neural Network

Conventional statistical parametric speech synthesis (SPSS) captures only frame-wise acoustic observations and computes probability densities at HMM state level to obtain statistical acoustic models combined with decision trees, which is therefore a purely statistical data-driven approach without explicit integration of any articulatory mechanisms found in speech production research. The presen...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2016

Contextual Representation using Recurrent Neural Network Hidden State for Statistical Parametric Speech Synthesis

نویسندگان

چکیده

منابع مشابه

Unit Selection with Hierarchical Cascaded Long Short Term Memory Bidirectional Recurrent Neural Nets

Statistical Parametric Speech Synthesis Using Bottleneck Representation From Sequence Auto-encoder

Acoustic Modeling in Statistical Parametric Speech Synthesis – from Hmm to Lstm-rnn

Fundamental Frequency Modelling: An Articulatory Perspective with Target Approximation and Deep Learning

Model-Based Parametric Prosody Synthesis with Deep Neural Network

عنوان ژورنال:

اشتراک گذاری